Inserting into large MySQL table without auto-increment primary key is very slow
P粉845862826
P粉845862826 2023-08-29 20:31:33
0
1
590
<p>I've recently noticed a significant increase in the difference in time required to complete a simple INSERT statement. While these statements take about 11 milliseconds on average, sometimes they can take 10-30 seconds, and I've even noticed them taking more than 5 minutes to execute. </p> <p>MySQL version is <code>8.0.24</code>, running on Windows Server 2016. As far as I know, the server's resources have never been overloaded. The server has ample CPU overhead available and 32GB of RAM allocated to it. </p> <p>This is the table I'm using: </p> <pre class="brush:php;toolbar:false;">CREATE TABLE `saved_segment` ( `recording_id` bigint unsigned NOT NULL, `index` bigint unsigned NOT NULL, `start_filetime` bigint unsigned NOT NULL, `end_filetime` bigint unsigned NOT NULL, `offset_and_size` bigint unsigned NOT NULL DEFAULT '18446744073709551615', `storage_id` tinyint unsigned NOT NULL, PRIMARY KEY (`recording_id`,`index`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci</pre> <p>This table has no other indexes or foreign keys, and is not used as a reference to foreign keys in any other table. The entire table size is about 20GB and the number of rows is about 281M, which I feel is not too big. </p> <p>The table is used almost exclusively in read-only mode, with up to 1000 reads per second. All of these reads occur in simple SELECT queries rather than complex transactions, and they effectively utilize the primary key index. There are very few, if any, concurrent writes to this table. This was done on purpose to try and figure out if it would help with slow insertion, but it didn't. Until then, there are always up to 10 concurrent inserts in progress. UPDATE or DELETE statements will never be executed on this table. </p> <p>The queries I have problems with are all constructed like this. They never appear in the transaction.While inserts based on the clustered primary key are definitely not append-only, the query will almost always insert between 1 and 20 adjacent rows into the table: </p> <pre class="brush:php;toolbar:false;">INSERT IGNORE INTO saved_segment (recording_id, `index`, start_filetime, end_filetime, offset_and_size, storage_id) VALUES (19173, 631609, 133121662986640000, 133121663016640000, 20562291758298876, 10), (19173, 631610, 133121663016640000, 133121663046640000, 20574308942546216, 10), (19173, 631611, 133121663046640000, 133121663076640000, 20585348350688128, 10), (19173, 631612, 133121663076640000, 133121663106640000, 20596854568114720, 10), (19173, 631613, 133121663106640000, 133121663136640000, 20609723363860884, 10), (19173, 631614, 133121663136640000, 133121663166640000, 20622106425668780, 10), (19173, 631615, 133121663166640000, 133121663196640000, 20634653501528448, 10), (19173, 631616, 133121663196640000, 133121663226640000, 20646967172721148, 10), (19173, 631617, 133121663226640000, 133121663256640000, 20657773176227488, 10), (19173, 631618, 133121663256640000, 133121663286640000, 20668825200822108, 10)</pre> <p>This is the output of the EXPLAIN statement for the above query: </p> <table class="s-table"> <thead> <tr> <th>id</th> <th>Select type</th> <th>Table</th> <th>Partition</th> <th>Type</th> <th>Possible keys</th> <th>key</th> <th>key_len</th> <th>Reference</th> <th>OK</th> <th>Filtered</th> <th>Extra</th> </tr> </thead> <tbody> <tr> <td>1</td> <td>Insert</td> <td>Saved segments</td> <td>Empty</td> <td>All</td> <td>Empty</td> <td>Empty</td> <td>Empty</td> <td>Empty</td> <td>Empty</td> <td>Empty</td> <td>Empty</td> </tr> </tbody> </table> <p>These issues are relatively new and were not noticeable when the table was about twice as small. </p> <p>I tried reducing the number of concurrent inserts in the table from about 10 to 1. I also removed the foreign keys (<code>recording_id</code>) on some columns to speed up the inserts even more. <code>Analyzing tables</code> and schema analysis did not yield any actionable information.</p> <p>One solution I thought of was to remove the clustered primary key and add an auto-increment primary key and a regular index on the <code>(recording_id, index)</code> column. In my opinion this would help make the insert "append-only".I'm open to any and all suggestions, thanks in advance! </p> <p>Edit: I'll address some of the points and questions raised in the comments and answers: </p> <ul> <li><code>autocommit</code> is set to <code>ON</code></li> The value of <li><code>innodb_buffer_pool_size</code> is <code>21474836480</code>, and the value of <code>innodb_buffer_pool_chunk_size</code> is <code>134217728< /code></ li> <li>One comment raised concerns about contention between the read lock used for reads and the exclusive lock used for writes. The table is used somewhat like a cache, I don't need the reads to always reflect the latest state of the table if it means an increase in performance. However, the table should remain durable even in the event of server crashes and hardware failures. Could this be achieved with a more relaxed transaction isolation level? </li> <li>The architecture can definitely be optimized; <code>recording_id</code> can be a 4-byte integer, <code>end_filetime</code> can be changed to a passed value, and <code>start_filetime</code> ; may also be smaller. I'm afraid these changes will just postpone the problem for a while until the size of the table increases to compensate for the space saved. </li> <li>Inserts into the table are always continuous A SELECT performed on the table looks like this: </li> </ul> <pre class="brush:php;toolbar:false;">SELECT TRUE FROM saved_segment WHERE recording_id = ? AND `index` = ?</pre> <pre class="brush:php;toolbar:false;">SELECT index, start_filetime, end_filetime, offset_and_size, storage_id FROM saved_segment WHERE recording_id = ? AND start_filetime >= ? AND start_filetime <= ? ORDER BY `index` ASC</pre> <p>The second type of query could certainly be improved with an index, but I'm concerned that this would further degrade INSERT performance. </p> <p>Another thing I forgot to mention is that a table very similar to this exists. It queries and inserts exactly the same, but may cause further IO starvation. </p> <p>Edit 2: <code>SHOW TABLE STATUS</code> table <code>saved_segment</code>, and a very similar table <code>saved_screenshot</code> (which is in <code>bigint unsigned not There is an additional index on the null</code> column).</p> <table class="s-table"> <thead> <tr> <th>Name</th> <th>Engine</th> <th>Version</th> <th>Row format</th> <th>OK</th> <th>Average line length</th> <th>Data length</th> <th>Maximum data length</th> <th>Index_length</th> <th>No data</th> <th>Auto-increment</th> <th>Creation time</th> <th>Updated</th> <th>Check time</th> <th>Organization</th> <th>Checksum</th> <th>Creation Options</th> <th>Comments</th> </tr> </thead> <tbody> <tr> <td>Saved screenshot</td> <td>InnoDB</td> <td>10</td> <td>News</td> <td>483430208</td> <td>61</td> <td>29780606976</td> <td>0</td> <td>21380464640</td> <td>6291456</td> <td>Empty</td> <td>“2021-10-21 01:03:21”</td> <td>“2022-11-07 16:51:45”</td> <td>Empty</td> <td>utf8mb4_0900_ai_ci</td> <td>Empty</td> <td></td> <td></td> </tr> <tr> <td>Saved segments</td> <td>InnoDB</td> <td>10</td> <td>News</td> <td>281861164</td> <td>73</td> <td>20802699264</td> <td>0</td> <td>0</td> <td>4194304</td> <td>Empty</td> <td>“2022-11-02 09:03:05”</td> <td>“2022-11-07 16:51:22”</td> <td>Empty</td> <td>utf8mb4_0900_ai_ci</td> <td>Empty</td> <td></td> <td></td> </tr> </tbody> </table></p>
P粉845862826
P粉845862826

reply all(1)
P粉022140576

I'll go out on a limb with this answer.

Assumption

  • innodb_buffer_pool_size has a value slightly less than 20MB, and
  • 1K selections per second arrive at random parts of the table, then

Systems have become I/O bound lately, as the "next" block required for the next Select is more and more often not cached in the buffer_pool.

The simple solution is to get more RAM and increase the setting of this tunable. But the table will only grow to the next limit you purchase.

Instead, here are some partial solutions.

  • If the numbers are not too large, the first two columns may be INT UNSIGNED (4 bytes instead of 8), or even MEDIUMINT UNSIGNED (3 bytes) . Note that ALTER TABLE will lock the table for a long time.
  • These start and end times look like timestamps with fractional seconds and are always ".000". DATETIME and TIMESTAMP take 5 bytes (instead of 8 bytes).
  • Your example shows an elapsed time of 0. If (end-start) is typically very small, storing the elapsed time instead of the end time will further shrink the data. (But using an end time can make things confusing).
  • The sample data you provided looks "continuous". This is about as efficient as auto-increment. Is this the norm? If not, the INSERT may be part of the I/O thrashing.
  • You suggest adding artificial intelligence as well as secondary indexes, which doubles the work of inserting; so I don't recommend it.

More

Yes, that is the case.

Putting this as INDEX, or better yet, as the beginning of PRIMARY KEY will give you the best help with both of your queries:

(recording_id, index)

reply:

SELECT  TRUE
FROM    saved_segment
WHERE   recording_id = ? AND `index` = ?

If it is used to control some other SQL, consider adding it to the other SQL:

... EXISTS ( SELECT 1
        FROM    saved_segment
        WHERE   recording_id = ? AND `index` = ? ) ...

This query (in either form) requires content you already have

PRIMARY KEY(recording_id, index)

Your other inquiry needs

INDEX(recording_id, start_filetime)

So, add index, or ...

Better...This combination is better for bothSELECT:

PRIMARY KEY(recording_id, start_filetime, index).
INDEX(recording_id, index)

With this combination,

  • Single row existence check will be performed "using index" because it is "covered".
  • Another query will find all related rows clustered together on the PK.
  • (PK has these 3 columns because it needs to be unique. Having them in this order helps your second query. And it's a PK, not just an INDEX, so it doesn't need to be in the index's Bounce between BTree and data of BTree.)
  • "Cluster" can improve performance by reducing the number of disk blocks required for such queries. This reduces "thrashing" in the buffer_pool, thereby reducing the need to increase RAM.
  • My index suggestions are mostly orthogonal to my data type suggestions.
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template