Here we use the cache system of nginx as a clue to discuss the design and related details of a cache server. I will try my best to analyze it from the perspective of design and framework. Due to space limitations, I will not go into the code here. Regarding the relevant details, everyone is welcome to join us. Participate in discussions.
After a cache server obtains a file from the backend, it is either sent directly to the client (scientific name is transparent transmission), or cached locally. When subsequent identical requests access the cache server, the local copy can be used directly. Yes, if it can be used. If a locally cached file is accessed by a subsequent request, it is called a hit in the cache. If there is no cache copy of the file locally, the cache server needs to go to the backend to obtain the file according to the configuration or resolve the domain name. This is called a cache miss, that is, a miss. For more knowledge about the cache server, we will discuss it in depth when analyzing the nginx cache system.
The storage system of nginx is divided into two categories. One is opened through proxy_store. The storage method is to store it locally according to the file path in the URL. For example, /file/2013/0001/en/test.html, then nginx will create each directory and file in the specified storage directory in sequence. The other type is opened through proxy_cache. The files stored in this way are not organized according to the URL path, but are managed using some special methods (called custom methods here). The custom methods are what we will focus on analysis. So what are the advantages of each of these two methods?
The method of storing files by URL path is relatively simple for the program to process, but the performance is not good. First of all, some URLs are very long. If we have to create such a deep directory on the local file system, opening and searching of files will be very slow (recall the process of searching for inodes through path names in the kernel). If you use a custom way to handle the pattern, although it is inseparable from files and paths, it will not increase complexity and reduce performance due to URL length. In a sense, this is a user-mode file system, and the most typical one is CFS in Squid. The method used by nginx is relatively simple, mainly relying on the md5 value of the URL for management, which we will analyze later.
Caching is inseparable from fetching content from the backend and then sending it to the client. It is easy for everyone to think of the specific processing method, which must be receiving and sending at the same time. Other methods are too inefficient, such as reading and then sending, etc. Let me mention here that nginx is receiving and sending at the same time. The structure used is ngx_event_pipe_t, which is the medium for communicating between the backend and the client. Since the structure is a general component, some special tags are needed to handle related functions involving storage, so the member cacheable takes on this important task.
p->cacheable = u->cacheable || u->store;
That is, if cacheable is 1, it needs to be stored, otherwise it will not be stored. So what do u->cacheable and u->store stand for? They respectively represent the two methods mentioned earlier, namely proxy_cache and proxy_store.
(To add some knowledge, when nginx fetches back-end data, its behavior is controlled by proxy_buffering, which is to enable response buffering for the back-end server. If buffering is enabled, nginx assumes that the proxy server can deliver the response very quickly, and will It is put into a buffer, and the relevant parameters can be set using proxy_buffer_size and proxy_buffers. If the response cannot fit into memory, it is written to the hard disk. If buffering is disabled, the response from the backend will be delivered to the client immediately.)
These are all side projects. We haven’t touched the core of nginx cache function yet. From an implementation point of view, there is a member called cache in the nginx upstream structure, and its type is ngx_shm_zone_t. If we enable the cache function, the cache member is used to manage shared memory (why is shared memory used?), and the member is NULL when stored in other ways. Another point that needs to be explained is that in the cache system, a file is usually called a store object, that is, a cache object, so you must create a store object before caching. An important question is how to choose the time to create it. What do you think about this? First we need to check whether a file needs to be cached. Obviously files requested by the GET method generally need to be cached, so in the early stage of request processing, if we see the GET method, we can create an object first. But many times, even a file requested by a GET method cannot be cached. If you create the object prematurely, you will not only waste time and space, but also destroy it in the end. So what affects the storage of GET requests? That is the Cache-control field in the response header. This field tells the proxy or browser whether the file can be cached. Generally, cache servers will cache requests by default when there is no Cache-control field in the response header.
Based on this consideration, the cache server we developed will only create cache objects after the response header is parsed and sufficient evidence of cacheability is obtained. Unfortunately, nginx does not do this.
nginx completes the creation of cache objects in the ngx_http_upstream_init_request function. At what stage of http processing is this function located? before establishing a connection with the backend. I personally think this place is not suitable. . . What do you think?
Regarding the creation process, you can read the function ngx_http_upstream_cache. Here I will analyze our cache by comparing it with nginx. Our request uses a member named store to establish contact with the cache object. The same goes for nginx, which has a cache member in its request structure to do the same thing. The difference is that the space corresponding to our store members is in shared memory, while nginx applies for it in r->pool (why do we do this?).
In the next step, nginx needs to generate the key of the cache object according to the configuration, which is generally calculated using md5. This key serves as the unique identifier of a cache object in the system. Many people may be worried about md5 collisions. I think this requirement is completely acceptable here if it is not particularly demanding, and the processing is relatively simple.
The next thing to deal with is, how should the files be stored on the disk?
Let’s take an example we used before: /file/2013/0001/en/test.html. Its corresponding md5 value is 8ef9229f02c5672c747dc7a324d658d0. In fact, nginx uses it as the file name. that's it? What happens if we find a directory to store files and there are a bunch of such files in it? We know that most file systems have restrictions on the number of files in a single directory, so such simple and crude processing is not possible. What to do? nginx allows you to use multi-level directories through configuration to solve this problem. To put it simply, nginx uses the levels directive to specify the number of directory levels (separated by colons) and the number of characters in each directory name. In our example, assume that the configuration levels=1:2 means that a two-level directory is used. The first-level directory name is one character, and the second-level directory name uses two characters. However, nginx supports up to 3 levels of directories, that is, levels=xxx:xxx:xxx.
So where do the characters that make up the directory name come from? Assume that our storage directory is /cache, levels=1:2, then the above file is stored like this:
/cache/0/8d/8ef9229f02c5672c747dc7a324d658d0
You see how the two directory names 0 and 8d came from, no need to explain.
After the object is created, you need to cache the object management structure, which is handled by ngx_http_file_cache_exists.
If the current directory and files already exist when creating this file, what should I do? You can go through the code and see how nginx handles it.
The discussion has come to an end. In fact, it is all preparatory work now. Next time we will discuss the processing of the arrival of back-end content.
Extended reading:
http://www.pagefault.info/?p=123
http://www.pagefault.info/?p=375
The above has introduced the design principles of nginx's cache system, including aspects of it. I hope it will be helpful to friends who are interested in PHP tutorials.