Memcached and redis, as the most commonly used cache servers in recent years, I believe everyone is familiar with them. When I was in school two years ago, I once read their main source code. Now I am writing a note to briefly compare their implementation methods from a personal perspective. I can just use it as a review. If there are any mistakes in my understanding, I welcome corrections.
Most of the architecture pictures used in this article come from the Internet. Some of the pictures are different from the latest implementation, which has been pointed out in the article.
To read the source code of a software, you must first understand what the software is used for. What are memcached and redis used for? As we all know, data is generally placed in a database, but querying the data is relatively slow. Especially when there are many users, frequent queries take a lot of time. How to do it? Where is the data placed so that it can be queried quickly? That must be in memory. Memcached and redis store data in memory and query it in a key-value manner, which can greatly improve efficiency. Therefore, they are generally used as cache servers to cache commonly used data. When queries are needed, they are obtained directly from them, which reduces the number of database queries and improves query efficiency.
How do memcached and redis provide services? They are independent processes. If necessary, they can be turned into daemon processes. Therefore, if our user process wants to use the memcached and redis services, inter-process communication is required. Considering that the user process, memcached and redis are not necessarily on the same machine, inter-network communication needs to be supported. Therefore, memcached and redis themselves are network servers, and user processes transmit data through the network with them. Obviously, the simplest and most commonly used one is to use tcp connection. In addition, both memcached and redis support udp protocol. And when the user process is on the same machine as memcached and redis, unix domain socket communication can also be used.
Let’s start with how they implement it. First, let’s take a look at their event models.
Since epoll came out, almost all network servers have abandoned select and poll and replaced them with epoll. The same is true for redis, except that it also provides support for select and poll. You can configure which one to use, but epoll is generally used. In addition, for BSD, the use of kqueue is also supported. Memcached is based on libevent, but the bottom layer of libevent also uses epoll, so it can be considered that they all use epoll. The characteristics of epoll will not be introduced here. There are many introduction articles online.
They all use epoll for event loop, but redis is a single-threaded server (redis is also multi-threaded, but except for the main thread, other threads do not have event loops, but only perform some background storage work), while memcached is multi-threaded. of. The event model of redis is very simple, with only one event loop and a simple reactor implementation. However, there is a bright spot in the redis event model. We know that epoll is for fd, and the ready event it returns is only fd. The fd in redis is the fd of the socket connecting the server and the client, but when processing, it needs to be based on this fd How to find specific client information? The usual processing method is to use a red-black tree to save fd and client information, and search through fd, and the efficiency is lgn. However, redis is special. The upper limit of the number of redis clients can be set, that is, the upper limit of the fd opened by redis can be known at the same time. And we know that the fd of the process will not be repeated at the same time (the fd can only be restored after it is closed). (), so redis uses an array and uses fd as the subscript of the array. The elements of the array are the client information. In this way, the client information can be located directly through fd. The search efficiency is O(1), which also eliminates complexity. The implementation of the red-black tree (I once wrote a network server in C, because I wanted to maintain the corresponding relationship between fd and connect and did not want to write the red-black tree myself, so I used the set in STL, which caused the project to become c++. In the end The project is compiled using g++, who knows if I don’t tell you?) Obviously, this method can only be used for network servers where the upper limit of the number of connections has been determined and is not too large. HTTP servers like nginx are not suitable. nginx just writes its own red-black tree.
Memcached is multi-threaded, using the master-worker method. The main thread listens to the port, establishes a connection, and then assigns it to each worker thread sequentially. Each slave thread has an event loop, which serves different clients. Pipeline communication is used between the master thread and the worker thread. Each worker thread will create a pipe, then save the write end and the read end, and add the read end to the event loop to listen for readable events. At the same time, each slave thread has a ready connection queue. After the main thread connects, it puts the connected item into this queue, and then writes a connect command to the write end of the thread's pipe, so that the pipe added in the event loop The reader will be ready, read the command from the thread, parse the command and find that there is a connection, and then go to its own ready queue to get the connection and process it. The advantage of multi-threading is that it can give full play to the advantages of multi-core, but writing a program is a bit more troublesome. Memcached has various locks and condition variables for thread synchronization.
The core tasks of memcached and redis are to operate data in memory, and memory management is naturally the core content.
First look at how they allocate memory. Memcached has its own memory pool, which means it allocates a large block of memory in advance, and then allocates the memory from the memory pool. This can reduce the number of memory allocations and improve efficiency. This is also the way most network servers are implemented. However, the management methods of each memory pool vary according to specific circumstances. Redis does not have its own memory pool, but allocates it directly when used, that is, allocates it when needed. The memory management is left to the kernel, and it is only responsible for fetching and releasing (redis is single-threaded and does not have its own memory pool. Does it feel like the implementation is too simple? That’s because it focuses on the database module). However, redis supports using tcmalloc to replace glibc's malloc. The former is a Google product and is faster than glibc's malloc.
Since redis does not have its own memory pool, the management of memory application and release is much simpler. Just malloc and free can be done directly, which is very convenient. Memcached supports the memory pool, so the memory application is obtained from the memory pool, and free is also returned to the memory pool, so it requires a lot of additional management operations and is very troublesome to implement. The details will be explained later in the slab mechanism of memcached. analyze.
Next, let’s take a look at their core content, the implementation of their respective databases.
Memcached only supports key-value, that is, only one key can correspond to one value. Its data is also stored in key-value pairs in memory, and it uses the slab mechanism.
First, let’s look at how memcached stores data, that is, it stores key-value pairs. As shown in the figure below, each key-value pair is stored in an item structure, including related attributes and key and value values.
Items store key-value pairs. When there are many items, how to find specific items is a problem. So memcached maintains a hash table, which is used to quickly find items. The hash table applies the open chain method (the same as redis) to resolve key conflicts. Each hash table bucket stores a linked list. The linked list node is the pointer of the item. As shown in the figure above, h_next refers to the next node of the linked list in the bucket. . The hash table supports expansion (expansion when the number of items is more than 1.5 of the number of buckets). There is a primary_hashtable and an old_hashtable. The primary_hashtable is normally used, but when expanding, old_hashtable = primary_hashtable is set, and then the primary_hashtable is set to the newly applied one. hash table (multiply the number of buckets by 2), and then move the data in the old_hashtable to the new hash table in sequence, and use a variable expand_bucket to record and how many buckets have been moved. After the move is completed, free the original old_hashtable ( Redis also has two hash tables, which are also moved, but it is not done by a background thread, but moves one bucket at a time). The expansion operation is completed by a background expansion thread. When expansion is needed, condition variables are used to notify it. After the expansion is completed, it blocks the condition variable waiting for expansion. In this way, when expanding, an item may be found in either primary_hashtable or old_hashtable. You need to compare the position of its bucket and the size of expand_bucket to determine which table it is in.
Where is item allocated from? from slab. As shown in the figure below, memcached has many slabclasses, which manage slabs. Each slab is actually a collection of trunks. The real items are allocated in the trunk, and each trunk is allocated an item. The size of the trunk in a slab is the same. For different slabs, the size of the trunk increases proportionally. When you need to apply for a new item, select the trunk according to its size. The rule is the smallest trunk that is larger than it. In this way, items of different sizes are allocated in different slabs and managed by different slabclasses. The disadvantage of this is that some memory will be wasted, because a trunk may be larger than the item, as shown in Figure 2. When allocating a 100B item, select the 112 trunk, but there will be a waste of 12B, and this part of the memory resource is not used.
As shown in the picture above, the entire structure is like this. slabclass manages slabs. A slabclass has a slab_list, which can manage multiple slabs. The trunk sizes of slabs in the same slabclass are all the same. The slabclass has a pointer slot, which saves the unallocated items that have been freed (not really free memory, just no longer used). When there are items that are not used, they are put into the head of the slot, so that every time they need to be When allocating an item in the current slab, you can directly access the slot, regardless of whether the item is unallocated or released.
Then, each slabclass corresponds to a linked list, with a head array and a tail array, which store the head node and tail node of the linked list respectively. The nodes in the linked list are the items allocated by the changed slabclass. The newly allocated items are placed at the head. The items further back in the linked list mean that they have not been used for a long time. When the slabclass has insufficient memory and needs to delete some expired items, it can be deleted from the end of the linked list. Yes, this linked list is designed to implement LRU. It's not enough to rely on it alone, because the query of the linked list is O(n), so when locating the item, use the hash table. This is already available. All allocated items are already in the hash table, so the hash is used to find the item. , and then the linked list can store the most recently used order of items, which is also the standard implementation method of lru.
Every time you need to allocate a new item, find the linked list corresponding to the slabclass and search from the end to the front to see if the item has expired. If it has expired, use the expired item directly as the new item. If it is not expired, you need to allocate the trunk from the slab. If the slab is used up, you need to add a slab to the slabclass.
Memcached supports setting the expiration time, that is, expire time, but it does not regularly check whether the data has expired internally. Instead, when the client process uses the data, memcached will check the expire time, and if it expires, it will directly return an error. The advantage of this is that no additional CPU is needed to check the expire time. The disadvantage is that the expired data may not be used for a long time and will not be released and occupy memory.
Memcached is multi-threaded and only maintains one database, so there may be multiple client processes operating on the same data, which may cause problems. For example, A has changed the data, and then B also changed the data, then A's operation will be overwritten, and A may not know that the current state of A's task data is the value after he changed it, so it may cause problems. In order to solve this problem, memcached uses the CAS protocol. Simply put, the item saves a 64-bit unsigned int value to mark the version of the data. Every time it is updated (the data value is modified), the version number increases, and then the data is changed each time. To operate, you need to compare whether the version number sent by the client process is consistent with the version number of the item on the server side. If they are consistent, you can perform a change operation, otherwise it will prompt dirty data.
The above is an introduction to how memcached implements a key-value database.
First of all, the redis database is more powerful, because unlike memcached, which only supports saving strings, redis supports five data structures: string, list, set, sorted set, and hash table. For example, to store a person's information, you can use a hash table, use the person's name as the key, and then name super, age 24. Through key and name, you can get the name super, or through key and age, you can get the age 24. In this way, when you only need to obtain the age, you don't need to retrieve the entire information of the person, and then look for the age from inside, and just obtain the age directly, which is efficient and convenient.
In order to implement these data structures, redis defines the abstract object redis object, as shown below. Each object has a type, a total of 5 types: string, linked list, set, ordered set, hash table. At the same time, in order to improve efficiency, redis prepares multiple implementation methods for each type, and chooses the appropriate implementation method according to specific scenarios. Encoding represents the implementation method of the object. Then the LRU of the object is recorded, which is the last time it was accessed. At the same time, a current time is recorded in the redis server (approximate value, because this time is only updated at certain intervals when the server performs automatic maintenance). The difference between them is enough to calculate how long the object has not been accessed. Then there is also reference counting in the redis object, which is used to share the object and then determine the deletion time of the object. Finally, a void* pointer is used to point to the real content of the object. Officially, due to the use of abstract redis object, it is much more convenient to operate data in the database. All redis object objects can be used uniformly. When it is necessary to distinguish the object type, judge based on the type. And officially due to the adoption of this object-oriented approach, the redis code looks a lot like C++ code, but in fact it is all written in C.
//#define REDIS_STRING 0 // 字符串类型 //#define REDIS_LIST 1 // 链表类型 //#define REDIS_SET 2 // 集合类型(无序的),可以求差集,并集等 //#define REDIS_ZSET 3 // 有序的集合类型 //#define REDIS_HASH 4 // 哈希类型 //#define REDIS_ENCODING_RAW 0 /* Raw representation */ //raw 未加工 //#define REDIS_ENCODING_INT 1 /* Encoded as integer */ //#define REDIS_ENCODING_HT 2 /* Encoded as hash table */ //#define REDIS_ENCODING_ZIPMAP 3 /* Encoded as zipmap */ //#define REDIS_ENCODING_LINKEDLIST 4 /* Encoded as regular linked list */ //#define REDIS_ENCODING_ZIPLIST 5 /* Encoded as ziplist */ //#define REDIS_ENCODING_INTSET 6 /* Encoded as intset */ //#define REDIS_ENCODING_SKIPLIST 7 /* Encoded as skiplist */ //#define REDIS_ENCODING_EMBSTR 8 /* Embedded sds string encoding */ typedef struct redisObject { unsigned type:4; // 对象的类型,包括 /* Object types */ unsigned encoding:4; // 底部为了节省空间,一种type的数据, // 可 以采用不同的存储方式 unsigned lru:REDIS_LRU_BITS; /* lru time (relative to server.lruclock) */ int refcount; // 引用计数 void *ptr; } robj;
After all, redis is still a key-value database. No matter how many data structures it supports, the final storage is still in key-value mode, but the value can be a linked list, set, sorted set, hash table, etc. Like memcached, all keys are strings, and strings are also used for specific storage such as set, sorted set, hash table, etc. There is no ready-made string in C, so the first task of redis is to implement a string, named sds (simple dynamic string). The following code is a very simple structure. len stores the total memory length of the string, and free means it is still available. How many bytes are unused, and buf stores specific data. Obviously len-free is the current length of the string.
struct sdshdr { int len; int free; char buf[]; };
字符串解决了,所有的key都存成sds就行了,那么key和value怎么关联呢?key-value的格式在脚本语言中很好处理,直接使用字典即可,C没有字典,怎么办呢?自己写一个呗(redis十分热衷于造轮子)。看下面的代码,privdata存额外信息,用的很少,至少我们发现。 dictht是具体的哈希表,一个dict对应两张哈希表,这是为了扩容(包括rehashidx也是为了扩容)。dictType存储了哈希表的属性。redis还为dict实现了迭代器(所以说看起来像c++代码)。
哈希表的具体实现是和mc类似的做法,也是使用开链法来解决冲突,不过里面用到了一些小技巧。比如使用dictType存储函数指针,可以动态配置桶里面元素的操作方法。又比如dictht中保存的sizemask取size(桶的数量)-1,用它与key做&操作来代替取余运算,加快速度等等。总的来看,dict里面有两个哈希表,每个哈希表的桶里面存储dictEntry链表,dictEntry存储具体的key和value。
前面说过,一个dict对于两个dictht,是为了扩容(其实还有缩容)。正常的时候,dict只使用dictht[0],当dict[0]中已有entry的数量与桶的数量达到一定的比例后,就会触发扩容和缩容操作,我们统称为rehash,这时,为dictht[1]申请rehash后的大小的内存,然后把dictht[0]里的数据往dictht[1]里面移动,并用rehashidx记录当前已经移动万的桶的数量,当所有桶都移完后,rehash完成,这时将dictht[1]变成dictht[0], 将原来的dictht[0]变成dictht[1],并变为null即可。不同于memcached,这里不用开一个后台线程来做,而是就在event loop中完成,并且rehash不是一次性完成,而是分成多次,每次用户操作dict之前,redis移动一个桶的数据,直到rehash完成。这样就把移动分成多个小移动完成,把rehash的时间开销均分到用户每个操作上,这样避免了用户一个请求导致rehash的时候,需要等待很长时间,直到rehash完成才有返回的情况。不过在rehash期间,每个操作都变慢了点,而且用户还不知道redis在他的请求中间添加了移动数据的操作,感觉redis太贱了 :-D
typedef struct dict { dictType *type; // 哈希表的相关属性 void *privdata; // 额外信息 dictht ht[2]; // 两张哈希表,分主和副,用于扩容 int rehashidx; /* rehashing not in progress if rehashidx == -1 */ // 记录当前数据迁移的位置,在扩容的时候用的 int iterators; /* number of iterators currently running */ // 目前存在的迭代器的数量 } dict; typedef struct dictht { dictEntry **table; // dictEntry是item,多个item组成hash桶里面的链表,table则是多个链表头指针组成的数组的指针 unsigned long size; // 这个就是桶的数量 // sizemask取size - 1, 然后一个数据来的时候,通过计算出的hashkey, 让hashkey & sizemask来确定它要放的桶的位置 // 当size取2^n的时候,sizemask就是1...111,这样就和hashkey % size有一样的效果,但是使用&会快很多。这就是原因 unsigned long sizemask; unsigned long used; // 已经数值的dictEntry数量 } dictht; typedef struct dictType { unsigned int (*hashFunction)(const void *key); // hash的方法 void *(*keyDup)(void *privdata, const void *key); // key的复制方法 void *(*valDup)(void *privdata, const void *obj); // value的复制方法 int (*keyCompare)(void *privdata, const void *key1, const void *key2); // key之间的比较 void (*keyDestructor)(void *privdata, void *key); // key的析构 void (*valDestructor)(void *privdata, void *obj); // value的析构 } dictType; typedef struct dictEntry { void *key; union { void *val; uint64_t u64; int64_t s64; } v; struct dictEntry *next; } dictEntry;
有了dict,数据库就好实现了。所有数据读存储在dict中,key存储成dictEntry中的key(string),用void* 指向一个redis object,它可以是5种类型中的任何一种。如下图,结构构造是这样,不过这个图已经过时了,有一些与redis3.0不符合的地方。
Each of the objects of type 5 has at least two underlying implementations. There are three types of string: REDIS_ENCODING_RAW, REDIS_ENCIDING_INT, REDIS_ENCODING_EMBSTR. Lists include: ordinary two-way linked list and compressed linked list. Compressed linked list simply means transforming the array into a linked list, continuous space, and then simulating the linked list by storing the size information of the string. Compared with ordinary linked lists, it can save space, but it has side effects. Because it is a continuous space, when changing the memory size, it needs to be reallocated, and because the byte size of the string is saved, it may cause continuous updates (specific implementation Please look at the code in detail). Set has dict and intset (use it to store when all integers are integers), sorted set has: skiplist and ziplist, and hashtable implementation has compressed list, dict and ziplist. Skiplist is a skip list. It has an efficiency close to that of a red-black tree, but is much simpler to implement than a red-black tree, so it is adopted (strange, there is no wheel reinvented here. Is it because this wheel is a bit difficult?). The hash table can be implemented using dict. In the dict, the key in each dictentry stores the key (which is the key of the key-value pair in the hash table), and the value stores the value. They are both strings. As for the dict in the set, the key in each dictentry stores the value of a specific element in the set, and the value is null. The zset (ordered set) in the picture is wrong. zset is implemented using skiplist and ziplist. First of all, skiplist is easy to understand, so just think of it as a substitute for red-black trees. Like red-black trees, it can also be sorted. How to use ziplist to store zset? First, in zset, each element in the set has a score, which is used to sort. Therefore, in ziplist, according to the score, the element is stored first, then its score, then the next element, and then the score. This is a continuous storage, so when inserting or deleting, memory needs to be reallocated. So when the number of elements exceeds a certain number, or the number of characters of an element exceeds a certain number, redis will choose to use skiplist to implement zset (if ziplist is currently used, the data in the ziplist will be taken out and stored in a new skiplist , and then delete and change the ziplist. This is the underlying conversion, and other types of redis objects can also be converted). In addition, how does ziplist implement hashtable? In fact, it is very simple, just store a key, store a value, then store a key, and then store a value. It is still stored sequentially, similar to the zset implementation, so when the elements exceed a certain number, or the number of characters of an element exceeds a certain number, it will be converted into a hashtable for implementation. Various underlying implementation methods can be converted, and redis can choose the most appropriate implementation method according to the situation. This is also the benefit of using a similar object-oriented implementation method.
It should be pointed out that when using skiplist to implement zset, a dict is actually used, which stores the same key-value pairs. why? Because skiplist search is only lgn (may become n), and dict can be O(1), so use a dict to speed up the search. Since skiplist and dict can point to the same redis object, too much memory will not be wasted. In addition, when using ziplist to implement zset, why not use dict to speed up the search? Because ziplist supports a small number of elements (when the number is large, it is converted into a skiplist), and sequential traversal is also fast, so dict is not needed.
From this point of view, the above dict, dictType, dictHt, dictEntry, and redis object are all very thoughtful. Together, they implement a flexible and efficient database with object-oriented color. I have to say that the design of the redis database is still very powerful.
Different from memcached, redis has more than one database, and there are 16 by default, numbered 0-15. Customers can choose which database to use, and database No. 0 is used by default. Different database data are not shared, that is, the same key can exist in different databases, but in the same database, the key must be unique.
Redis also supports the setting of expire time. Let's look at the redis object above. There is no field to save expire. So how does redis record the expire time of data? Redis adds another dict to each database. This dict is called expire dict. The key in the dict entry is a pair of keys, and the value is a redis object whose data is a 64-bit int. This int is the expire time. . In this way, when determining whether a key has expired, you can find it in the expire dict, take out the expire time and compare it with the current time. Why do this? Because not all keys will have an expiration time set, for keys that do not set an expire time, saving an expire time will waste space. Instead, use an expire dict to save it separately, and you can use the memory flexibly as needed (detected When the key expires, it will be deleted from the expire dict).
redis的expire 机制是怎样的呢? 与memcahed类似,redis也是惰性删除,即要用到数据时,先检查key是否过期,过期则删除,然后返回错误。单纯的靠惰性删除,上面说过可能会导致内存浪费,所以redis也有补充方案,redis里面有个定时执行的函数,叫servercron,它是维护服务器的函数,在它里面,会对过期数据进行删除,注意不是全删,而是在一定的时间内,对每个数据库的expire dict里面的数据随机选取出来,如果过期,则删除,否则再选,直到规定的时间到。即随机选取过期的数据删除,这个操作的时间分两种,一种较长,一种较短,一般执行短时间的删除,每隔一定的时间,执行一次长时间的删除。这样可以有效的缓解光采用惰性删除而导致的内存浪费问题。
以上就是redis的数据的实现,与memcached不同,redis还支持数据持久化,这个下面介绍。
redis和memcached的最大不同,就是redis支持数据持久化,这也是很多人选择使用redis而不是memcached的最大原因。 redis的持久化,分为两种策略,用户可以配置使用不同的策略。
4.1 RDB持久化
用户执行save或者bgsave的时候,就会触发RDB持久化操作。RDB持久化操作的核心思想就是把数据库原封不动的保存在文件里。
那如何存储呢?如下图, 首先存储一个REDIS字符串,起到验证的作用,表示是RDB文件,然后保存redis的版本信息,然后是具体的数据库,然后存储结束符EOF,最后用检验和。关键就是databases,看它的名字也知道,它存储了多个数据库,数据库按照编号顺序存储,0号数据库存储完了,才轮到1,然后是2, 一直到最后一个数据库。
每一个数据库存储方式如下,首先一个1字节的常量SELECTDB,表示切换db了,然后下一个接上数据库的编号,它的长度是可变的,然后接下来就是具体的key-value对的数据了。
int rdbSaveKeyValuePair(rio *rdb, robj *key, robj *val, long long expiretime, long long now) { /* Save the expire time */ if (expiretime != -1) { /* If this key is already expired skip it */ if (expiretime < now) return 0; if (rdbSaveType(rdb,REDIS_RDB_OPCODE_EXPIRETIME_MS) == -1) return -1; if (rdbSaveMillisecondTime(rdb,expiretime) == -1) return -1; } /* Save type, key, value */ if (rdbSaveObjectType(rdb,val) == -1) return -1; if (rdbSaveStringObject(rdb,key) == -1) return -1; if (rdbSaveObject(rdb,val) == -1) return -1; return 1; }
由上面的代码也可以看出,存储的时候,先检查expire time,如果已经过期,不存就行了,否则,则将expire time存下来,注意,及时是存储expire time,也是先存储它的类型为REDIS_RDB_OPCODE_EXPIRETIME_MS,然后再存储具体过期时间。接下来存储真正的key-value对,首先存储value的类型,然后存储key(它按照字符串存储),然后存储value,如下图。
在rdbsaveobject中,会根据val的不同类型,按照不同的方式存储,不过从根本上来看,最终都是转换成字符串存储,比如val是一个linklist,那么先存储整个list的字节数,然后遍历这个list,把数据取出来,依次按照string写入文件。对于hash table,也是先计算字节数,然后依次取出hash table中的dictEntry,按照string的方式存储它的key和value,然后存储下一个dictEntry。 总之,RDB的存储方式,对一个key-value对,会先存储expire time(如果有的话),然后是value的类型,然后存储key(字符串方式),然后根据value的类型和底层实现方式,将value转换成字符串存储。这里面为了实现数据压缩,以及能够根据文件恢复数据,redis使用了很多编码的技巧,有些我也没太看懂,不过关键还是要理解思想,不要在意这些细节。
保存了RDB文件,当redis再启动的时候,就根据RDB文件来恢复数据库。由于以及在RDB文件中保存了数据库的号码,以及它包含的key-value对,以及每个key-value对中value的具体类型,实现方式,和数据,redis只要顺序读取文件,然后恢复object即可。由于保存了expire time,发现当前的时间已经比expire time大了,即数据已经超时了,则不恢复这个key-value对即可。
Saving RDB files is a huge project, so redis also provides a background saving mechanism. That is, when bgsave is executed, redis forks a child process to let the child process perform the saved work, while the parent process continues to provide redis's normal database services. Since the child process copies the address space of the parent process, that is, the child process owns the database when the parent process forks, the child process performs a save operation and writes the database it inherited from the parent process into a temp file. During the copying of the child process, redis will record the number of modifications (dirty) of the database. When the child process is completed, the SIGUSR1 signal is sent to the parent process. When the parent process catches this signal, it knows that the child process has completed the copy. Then the parent process renames the temp file saved by the child process to the real rdb file (that is, the real saving is successful). Only then changed to the target file, this is the safe way). Then record the end time of this save.
There is a problem here. During the saving period of the child process, the database of the parent process has been modified, and the parent process only records the number of modifications (dirty) without any correction operations. It seems that what RDB saves is not a real-time database, which is a bit unpretentious. However, AOF persistence, which will be introduced later, solves this problem.
In addition to the customer executing the sava or bgsave command, the RDB saving conditions can also be configured. That is, it is configured in the configuration file. If the database is modified dirty times within t time, it will be saved in the background. When redis is serving cron, it will judge whether it meets the conditions based on the number of dirty items and the time of the last save. If the conditions are met, bg save will be performed. Note that there can only be one child process for background saving at any time, because saving It is a very expensive IO operation. A large number of IO operations in multiple processes are inefficient and difficult to manage.
4.2 AOF persistence
First of all, think about a question. Does saving a database require saving all the data in the database like RDB? Is there any other way?
RDB saves only the final database, which is a result. How did the result come about? It is established through various commands of the user, so the result does not need to be saved, but only the command that created the result is saved. The AOF of redis has this idea. Unlike RDB, it saves the db data. It saves the commands to create the database one by one.
Let's first look at the format of the AOF file. It stores commands one by one. First, the command length is stored, and then the command is stored. You can study the specific delimiters in depth by yourself. This is not the point. Anyway, you know how the AOF file is stored. It is the command executed by the redis client.
There is an sds aof_buf in the redis server. If aof persistence is turned on, each command to modify the database will be stored in this aof_buf (the string in the command format in the aof file is saved), and then the event loop does not loop once. In the server cron Call flushaofbuf, write the command in aof_buf to the aof file (actually write, what is actually written is the kernel buffer), then clear aof_buf and enter the next loop. In this way, all database changes can be restored through the commands in the aof file, achieving the effect of saving the database.
It should be noted that the write called in flushaofbuf only writes the data into the kernel buffer. The actual writing of the file is decided by the kernel itself and may need to be delayed for a period of time. However, redis supports configuration. You can configure sync after each write, and then call sync in redis to write the data in the kernel to the file. This only requires a system call and is time-consuming. You can also configure the policy to sync once every second, then redis will start a background thread (so redis is not a single thread, just a single eventloop), and this background thread will call sync every second. Here I have to ask, why was sync not considered when using RDB? Because RDB is stored once, unlike AOF multiple times, calling sync once during RDB has no effect, and when using bg save, the child process will exit (exit) by itself, and the buffer will be flushed in the exit function at this time. area, it is automatically written to the file.
Looking at it again, if you don't want to use aof_buf to save each modification command, you can also use aof persistence. Redis provides aof_rewrite, which generates commands based on the existing database and then writes the commands into the aof file. Very strange, right? Yes, it's that awesome. When performing aof_rewrite, the redis variable is used in each database, and then different commands are generated based on the specific type of value in the key-value pair, such as list, then it generates a command to save the list, which contains the information needed to save the list. The required data, if the list data is too long, will be divided into multiple commands. First create the list, and then add elements to the list. In short, the command to save the data is generated in reverse based on the data. Then store these commands in the aof file. Doesn't this achieve the same effect as aof append?
再来看,aof格式也支持后台模式。执行aof_bgrewrite的时候,也是fork一个子进程,然后让子进程进行aof_rewrite,把它复制的数据库写入一个临时文件,然后写完后用新号通知父进程。父进程判断子进程的退出信息是否正确,然后将临时文件更名成最终的aof文件。好了,问题来了。在子进程持久化期间,可能父进程的数据库有更新,怎么把这个更新通知子进程呢?难道要用进程间通信么?是不是有点麻烦呢?你猜redis怎么做的?它根本不通知子进程。什么,不通知?那更新怎么办? 在子进程执行aof_bgrewrite期间,父进程会保存所有对数据库有更改的操作的命令(增,删除,改等),把他们保存在aof_rewrite_buf_blocks中,这是一个链表,每个block都可以保存命令,存不下时,新申请block,然后放入链表后面即可,当子进程通知完成保存后,父进程将aof_rewrite_buf_blocks的命令append 进aof文件就可以了。多么优美的设计,想一想自己当初还考虑用进程间通信,别人直接用最简单的方法就完美的解决了问题,有句话说得真对,越优秀的设计越趋于简单,而复杂的东西往往都是靠不住的。
至于aof文件的载入,也就是一条一条的执行aof文件里面的命令而已。不过考虑到这些命令就是客户端发送给redis的命令,所以redis干脆生成了一个假的客户端,它没有和redis建立网络连接,而是直接执行命令即可。首先搞清楚,这里的假的客户端,并不是真正的客户端,而是存储在redis里面的客户端的信息,里面有写和读的缓冲区,它是存在于redis服务器中的。所以,如下图,直接读入aof的命令,放入客户端的读缓冲区中,然后执行这个客户端的命令即可。这样就完成了aof文件的载入。
// 创建伪客户端 fakeClient = createFakeClient(); while(命令不为空) { // 获取一条命令的参数信息 argc, argv ... // 执行 fakeClient->argc = argc; fakeClient->argv = argv; cmd->proc(fakeClient); }
整个aof持久化的设计,个人认为相当精彩。其中有很多地方,值得膜拜。
redis另一个比memcached强大的地方,是它支持简单的事务。事务简单说就是把几个命令合并,一次性执行全部命令。对于关系型数据库来说,事务还有回滚机制,即事务命令要么全部执行成功,只要有一条失败就回滚,回到事务执行前的状态。redis不支持回滚,它的事务只保证命令依次被执行,即使中间一条命令出错也会继续往下执行,所以说它只支持简单的事务。
首先看redis事务的执行过程。首先执行multi命令,表示开始事务,然后输入需要执行的命令,最后输入exec执行事务。 redis服务器收到multi命令后,会将对应的client的状态设置为REDIS_MULTI,表示client处于事务阶段,并在client的multiState结构体里面保持事务的命令具体信息(当然首先也会检查命令是否能否识别,错误的命令不会保存),即命令的个数和具体的各个命令,当收到exec命令后,redis会顺序执行multiState里面保存的命令,然后保存每个命令的返回值,当有命令发生错误的时候,redis不会停止事务,而是保存错误信息,然后继续往下执行,当所有的命令都执行完后,将所有命令的返回值一起返回给客户。redis为什么不支持回滚呢?网上看到的解释出现问题是由于客户程序的问题,所以没必要服务器回滚,同时,不支持回滚,redis服务器的运行高效很多。在我看来,redis的事务不是传统关系型数据库的事务,要求CIAD那么非常严格,或者说redis的事务都不是事务,只是提供了一种方式,使得客户端可以一次性执行多条命令而已,就把事务当做普通命令就行了,支持回滚也就没必要了。
我们知道redis是单event loop的,在真正执行一个事物的时候(即redis收到exec命令后),事物的执行过程是不会被打断的,所有命令都会在一个event loop中执行完。但是在用户逐个输入事务的命令的时候,这期间,可能已经有别的客户修改了事务里面用到的数据,这就可能产生问题。所以redis还提供了watch命令,用户可以在输入multi之前,执行watch命令,指定需要观察的数据,这样如果在exec之前,有其他的客户端修改了这些被watch的数据,则exec的时候,执行到处理被修改的数据的命令的时候,会执行失败,提示数据已经dirty。 这是如何是实现的呢? 原来在每一个redisDb中还有一个dict watched_keys,watched_kesy中dictentry的key是被watch的数据库的key,而value则是一个list,里面存储的是watch它的client。同时,每个client也有一个watched_keys,里面保存的是这个client当前watch的key。在执行watch的时候,redis在对应的数据库的watched_keys中找到这个key(如果没有,则新建一个dictentry),然后在它的客户列表中加入这个client,同时,往这个client的watched_keys中加入这个key。当有客户执行一个命令修改数据的时候,redis首先在watched_keys中找这个key,如果发现有它,证明有client在watch它,则遍历所有watch它的client,将这些client设置为REDIS_DIRTY_CAS,表面有watch的key被dirty了。当客户执行的事务的时候,首先会检查是否被设置了REDIS_DIRTY_CAS,如果是,则表明数据dirty了,事务无法执行,会立即返回错误,只有client没有被设置REDIS_DIRTY_CAS的时候才能够执行事务。 需要指出的是,执行exec后,该client的所有watch的key都会被清除,同时db中该key的client列表也会清除该client,即执行exec后,该client不再watch任何key(即使exec没有执行成功也是一样)。所以说redis的事务是简单的事务,算不上真正的事务。
以上就是redis的事务,感觉实现很简单,实际用处也不是太大。
redis支持频道,即加入一个频道的用户相当于加入了一个群,客户往频道里面发的信息,频道里的所有client都能收到。
实现也很简单,也watch_keys实现差不多,redis server中保存了一个pubsub_channels的dict,里面的key是频道的名称(显然要唯一了),value则是一个链表,保存加入了该频道的client。同时,每个client都有一个pubsub_channels,保存了自己关注的频道。当用用户往频道发消息的时候,首先在server中的pubsub_channels找到改频道,然后遍历client,给他们发消息。而订阅,取消订阅频道不够都是操作pubsub_channels而已,很好理解。
同时,redis还支持模式频道。即通过正则匹配频道,如有模式频道p, 1, 则向普通频道p1发送消息时,会匹配p,1,除了往普通频道发消息外,还会往p,1模式频道中的client发消息。注意,这里是用发布命令里面的普通频道来匹配已有的模式频道,而不是在发布命令里制定模式频道,然后匹配redis里面保存的频道。实现方式也很简单,在redis server里面有个pubsub_patterns的list(这里为什么不用dict?因为pubsub_patterns的个数一般较少,不需要使用dict,简单的list就好了),它里面存储的是pubsubPattern结构体,里面是模式和client信息,如下所示,一个模式,一个client,所以如果有多个clint监听一个pubsub_patterns的话,在list面会有多个pubsubPattern,保存client和pubsub_patterns的对应关系。 同时,在client里面,也有一个pubsub_patterns list,不过里面存储的就是它监听的pubsub_patterns的列表(就是sds),而不是pubsubPattern结构体。
typedef struct pubsubPattern { redisClient *client; // 监听的client robj *pattern; // 模式 } pubsubPattern;
当用户往一个频道发送消息的时候,首先会在redis server中的pubsub_channels里面查找该频道,然后往它的客户列表发送消息。然后在redis server里面的pubsub_patterns里面查找匹配的模式,然后往client里面发送消息。 这里并没有去除重复的客户,在pubsub_channels可能已经给某一个client发过message了,然后在pubsub_patterns中可能还会给用户再发一次(甚至更多次)。 估计redis认为这是客户程序自己的问题,所以不处理。
/* Publish a message */ int pubsubPublishMessage(robj *channel, robj *message) { int receivers = 0; dictEntry *de; listNode *ln; listIter li; /* Send to clients listening for that channel */ de = dictFind(server.pubsub_channels,channel); if (de) { list *list = dictGetVal(de); listNode *ln; listIter li; listRewind(list,&li); while ((ln = listNext(&li)) != NULL) { redisClient *c = ln->value; addReply(c,shared.mbulkhdr[3]); addReply(c,shared.messagebulk); addReplyBulk(c,channel); addReplyBulk(c,message); receivers++; } } /* Send to clients listening to matching channels */ if (listLength(server.pubsub_patterns)) { listRewind(server.pubsub_patterns,&li); channel = getDecodedObject(channel); while ((ln = listNext(&li)) != NULL) { pubsubPattern *pat = ln->value; if (stringmatchlen((char*)pat->pattern->ptr, sdslen(pat->pattern->ptr), (char*)channel->ptr, sdslen(channel->ptr),0)) { addReply(pat->client,shared.mbulkhdr[4]); addReply(pat->client,shared.pmessagebulk); addReplyBulk(pat->client,pat->pattern); addReplyBulk(pat->client,channel); addReplyBulk(pat->client,message); receivers++; } } decrRefCount(channel); } return receivers; }
总的来看,redis比memcached的功能多很多,实现也更复杂。 不过memcached更专注于保存key-value数据(这已经能满足大多数使用场景了),而redis提供更丰富的数据结构及其他的一些功能。不能说redis比memcached好,不过从源码阅读的角度来看,redis的价值或许更大一点。 另外,redis3.0里面支持了集群功能,这部分的代码还没有研究,后续再跟进。
The above is the detailed content of Comparison of memcached and redis implementations. For more information, please follow other related articles on the PHP Chinese website!