As an open source distributed storage software, Ceph can use the local storage resources of the X86 server to create one or more storage resource pools, and provide users with unified storage services based on the resource pools, including block storage and object storage , file storage, which meets the needs of enterprises for high reliability, high performance, and high scalability of storage, and is increasingly favored by enterprises. After a large amount of production practice, it has been proven that Ceph has advanced design concepts, comprehensive functions, flexible use, and high flexibility. However, these advantages of Ceph are also a double-edged sword for enterprises. If they are well controlled, they can serve the enterprise well. If they are not skilled enough and do not understand the temperament of Ceph, it will sometimes cause a lot of trouble. Below I will It is such a case that I want to share with you.
Company A deploys Ceph object storage clusters to provide external cloud storage services and provides SDK to help customers quickly realize unstructured data such as pictures, videos, and apk installation packages. Cloud management. Before the business was officially launched, sufficient functional testing, exception testing and performance testing were conducted on Ceph.
The cluster size is not very large. It uses the community version 0.80. There are a total of 30 servers. Each server is configured with 32GB memory, 10 4T SATA disks and 1 160G Intel S3700 SSD disk. 300 SATA disks form a data resource pool (the default configuration is a pool named .rgw.buckets) to store object data; 30 SSD disks form a metadata resource pool (the default configuration is a pool named . rgw.buckets.index pool), which stores object metadata. Friends who have experience in Ceph object storage operation and maintenance deployment know that this configuration is also an international practice because Ceph object storage supports multi-tenancy. When multiple users PUT objects into the same bucket (a logical space of the user) , the metadata of the object will be written to the bucket index object. Since the same bucket index object is shared, this index object needs to be locked during access, and the bucket index object will be stored in a resource pool composed of high-performance disk SSD, reducing The time of each index object access is improved, IO performance is improved, and the overall concurrency of object storage is increased.
After the system went online, customer data began to be continuously stored in the Ceph object storage cluster. In the first three months, everything ran normally. SATA disk failures also occurred during this period, and they were easily resolved by relying on Ceph's own fault detection and repair mechanisms. The operation and maintenance brothers felt very relaxed. Entering May, the operation and maintenance brothers occasionally complained that the OSD corresponding to the SSD disk sometimes became very slow, causing business lags. When encountering this situation, their simple and effective way was to restart the OSD and return to normal. This phenomenon probably happened sporadically a few times, and the operation and maintenance brother asked if there was anything wrong with our use of SSD. After analysis, we feel that there is nothing special about the application of SSD disks. Except for changing the disk scheduling algorithm to deadline, this has already been modified, so we don’t pay too much attention to this matter.
At 21:30 in the evening on May 28, the operation and maintenance brothers received a system alarm on their mobile phones. A small number of files failed to be written. They immediately logged in to the system to check and found that it was because of the OSD corresponding to the SSD disk on a server. Caused by slow reading and writing. According to previous experience, in such cases, the OSD process can be restored to normal after restarting it. Restart the OSD process without hesitation and wait for the system to return to normal. But this time, the SSD's OSD process started very slowly, causing the SATA disk's OSD process on the same server to freeze and lose its heartbeat. After a period of time, it was discovered that the SSD disk's OSD process started to freeze slowly on other servers. Continue to restart the OSD processes corresponding to the SSD disks on other servers, and a similar situation occurs. After restarting the SSD disk OSD processes repeatedly, more and more SSD disk OSD processes cannot be started. The operation and maintenance brothers immediately reported the situation to the technical research and development department and requested immediate support.
After arriving at the office, based on the feedback from the operation and maintenance brothers, we boarded the server, tried to start the OSD processes corresponding to several SSD disks, and repeatedly observed the startup process of the comparison process:
1 . Use the top command to find that the OSD process starts to allocate memory crazily after it starts, up to 20GB or even sometimes 30GB; sometimes the system memory is exhausted and swap partitions are used; sometimes even if the process is successfully pulled up in the end, the OSD is still occupied. Up to 10GB of memory.
2. Check the OSD log and find that the log output stops after entering the FileJournal::_open stage. After a long time (more than 30 minutes), the output enters the load_pg stage; after entering the load_pg stage, there is another long wait. Although load_pg is completed, the process still commits suicide and exits.
3. During the above long startup process, use pstack to view the process call stack information. The call stack seen in the FileJournal::_open stage is OSD log playback, and levelDB record deletion is performed during transaction processing; The call stack information seen in the load_pg stage is using the levelDB log to repair the levelDB file.
4. Sometimes an SSD disk OSD process starts successfully, but after running for a period of time, another SSD disk OSD process will die abnormally.
Judging from these phenomena, they are all related to levelDB. Is the large allocation of memory related to this? After further looking at the levelDB-related code, we found that when a levelDB iterator is used in a transaction, memory will be continuously allocated during the iterator's access to records, and all memory will not be released until the iterator is used up. From this point of view, if the number of records accessed by the iterator is very large, a large amount of memory will be allocated during the iteration process. Based on this, we looked at the number of objects in the bucket and found that the number of objects in several buckets reached 20 million, 30 million, and 50 million, and the storage locations of these large bucket index objects happened to be the ones where the problem occurred. SSD disk OSD. The reason for the large amount of memory consumption should have been found. This is a major breakthrough. It is already 21:00 on the 30th. In the past two days, users have started calling to complain, and the brothers all feel that it is "big trouble". They have been fighting for nearly 48 hours, and the brothers' eyes are red and swollen. They must stop and rest, otherwise some brothers will fall before dawn.
At 8:30 on the 31st, the brothers went into battle again.
Another problem is that some OSDs go through a long startup process and eventually exit by suicide after load_pg is completed. By reading the ceph code, it was confirmed that some threads committed suicide due to timeout due to not being scheduled for a long time (possibly due to the levelDB thread occupying the CPU for a long time). There is a filestore_op_thread_suicide_timeout parameter in the ceph configuration. Through testing and verification, setting this parameter to a large value can avoid this kind of suicide. We saw a little bit of light again, and the clock pointed to 12:30.
After some processes are started, they will still occupy up to 10GB of memory. If this problem is not solved, even if the SSD disk OSD is pulled up, the operation of other SATA disk OSDs on the same server will be affected due to insufficient memory. Brothers, keep up the good work, this is the darkness before dawn, you must get through it. Some people checked the information, some looked at the code, and finally found a command to force memory release from the ceph information document at 14:30: ceph tell osd.* heap release. You can execute this command after the process starts to release the excessive memory occupied by the OSD process. . Everyone was very excited and immediately tested and verified that it was indeed effective.
After an SSD disk OSD is started and runs for a while, the OSD process of other SSD disks will exit. Based on the above analysis and positioning, this is mainly due to data migration. The OSD with data migration will Deleting related record information triggers levelDB to delete object metadata records. Once it encounters an oversized bucket index object, levelDB uses an iterator to traverse the object's metadata records, which will cause excessive memory consumption and cause the OSD process on the server to be abnormal.
Based on the above analysis and after nearly 2 hours of repeated discussions and demonstrations, we have formulated the following emergency measures:
1. Set the noout flag to the cluster and do not allow PG migration, because once it occurs PG migration, if the OSD has a PG moved out, the object data in the PG will be deleted after the PG is moved out, triggering levelDB to delete the object metadata record. If there is an oversized bucket index object in the PG, the iterator will traverse the metadata. Data logging consumes a lot of memory.
2. In order to save the OSD corresponding to the SSD and restore the system as soon as possible, when starting the OSD process corresponding to the SSD, add the startup parameter filestore_op_thread_suicide_timeout and set a large value. When the faulty OSD is pulled up, a series of LevelDB processes will seize the CPU, causing thread scheduling to be blocked. There is a thread deadlock detection mechanism in Ceph. If the thread is still not scheduled after the time configured by this parameter, it is determined to be a thread deadlock. In order to avoid process suicide due to thread deadlock, this parameter needs to be set.
3. In the current situation of limited memory, abnormal OSD startup will use the swap partition. In order to speed up the OSD process startup, adjust the swap partition to the SSD disk.
4. Start a scheduled task and execute the command ceph tell osd.* heap release regularly to force the memory occupied by OSD to be released.
5. When there is a problem with the OSD corresponding to the SSD, follow the steps below:
a) First stop all OSD processes on the server to free up all memory.
b) Then start the OSD process and carry the filestore_op_thread_suicide_timeout parameter, giving a large value, such as 72000.
c) Observe the startup process of OSD. Once load_pgs is executed, you can immediately manually execute the ceph tell osd.N heap release command to forcibly release the memory occupied by it.
d) Observe the cluster status. When the status of all PGs returns to normal, start the OSD processes corresponding to other SATA disks.
Following the above steps, we will restore the OSD processes one by one starting from 17:30. During the recovery process, those very large bucket index objects will take a long time to do backfilling. During this period, the bucket will be accessed. All requests are blocked, causing application business requests to time out. This is also a negative impact caused by storing a large number of objects in a single bucket.
At 23:00 on May 31st, all OSD processes were finally restored. From the failure to the successful recovery of the system, we worked hard for 72 thrilling hours. Everyone looked at each other and smiled, overly excited, and continued to work hard to discuss and make plans together. Solutions to completely solve this problem:
1. Expand the server memory to 64GB.
2. For new buckets, limit the maximum number of storage objects.
3. After Ceph version 0.94 has been fully tested, it will be upgraded to version 0.94 to solve the problem of excessively large single-bucket index objects.
4. Optimize Ceph's use of levelDB iterators. In a large transaction, through segmented iteration, an iterator records its current iteration position and releases it after completing a certain number of record traversals. , and then re-create a new iterator and continue traversing from the position of the last iteration, so that the memory usage of the iterator can be controlled.
As a teacher who never forgets the past and learns lessons from the past, we summarize the following points:
1. The system must be fully tested before going online
Company A's Before the system went online, although Ceph was fully tested for functions, performance, and exceptions, there was no stress test on a large amount of data. If tens of millions of objects were tested in a single bucket before, this hidden danger might be discovered in advance.
2. Every abnormality in the operation and maintenance process must be paid attention to in time
In this case, some time before the problem broke out, the operation and maintenance department had already reported the problem of SSD abnormality. Unfortunately, We did not pay attention to it. If we had conducted in-depth analysis at that time, we might have been able to find the root cause of the problem and formulate avoidance measures in advance.
3. Find out the temperament of ceph
Any software product has corresponding specification restrictions, and ceph is no exception. If we can have an in-depth understanding of the ceph architecture and its implementation principles in advance, understand the negative impacts of excessive storage of a large number of objects in a single bucket, and plan in advance, the problems encountered in this case will not occur. RGW has very comprehensive support for quotas, including user-level and bucket-level quotas. The maximum number of objects allowed to be stored in a single bucket can be configured.
4. Always track the latest progress of the community
In version 0.94 of Ceph, the shard function of bucket index objects has been supported. A bucket index object can be divided into multiple shard objects for storage, which can be effectively Alleviate the problem of excessively large single-bucket index objects.