Is full count query really that slow on a large MySQL InnoDB table?
P粉291886842
P粉291886842 2023-11-07 15:52:43
0
2
834

We have a large table with millions of entries. Full counting is very slow, see code below. Is this common for MySQL InnoDB tables? Isn't there a way to speed up this process? Even with query caching, it's still "slow". I'd also like to know why the count of the "communication" table with 2.8 mio entries is slower than the count of the "transaction" table with 4.5 mio entries.

I know it would be faster to use where clause. I'm just wondering if poor performance is normal.

We use Amazon RDS MySQL 5.7 and m4.xlarge (4 CPUs, 16 GB RAM, 500 GB storage). I've also tried larger instances with more CPU and RAM, but no big change in query times.

mysql> SELECT COUNT(*) FROM transaction;
+----------+
| COUNT(*) |
+----------+
|  4569880 |
+----------+
1 row in set (1 min 37.88 sec)

mysql> SELECT COUNT(*) FROM transaction;
+----------+
| count(*) |
+----------+
|  4569880 |
+----------+
1 row in set (1.44 sec)

mysql> SELECT COUNT(*) FROM communication;
+----------+
| count(*) |
+----------+
|  2821486 |
+----------+
1 row in set (2 min 19.28 sec)


P粉291886842
P粉291886842

reply all(2)
P粉401901266

This is supported using Multiple Version Concurrency Control (MVCC).

InnoDB allows your queries to be isolated within transactions without blocking other concurrent clients that are reading and writing data rows. These concurrent updates do not affect your transaction's view of the data.

But, what is the number of rows in the table, considering that many rows are being added or deleted while the count is being made? The answer is vague.

Your transaction should not be able to "see" row versions created after the transaction started. Likewise, your transaction should count rows even if someone else requests row deletion, but they do so after your transaction starts.

The answer is that when you perform a SELECT COUNT(*) or any other type of query that requires checking multiple rows, InnoDB must access each row to see the The current version is visible to the transaction view of the database and is counted while it is visible.

In tables that do not support transactions or concurrent updates (such as MyISAM), the storage engine retains the total number of rows as metadata for the table. The storage engine cannot support multiple threads updating rows simultaneously, so the row total is less ambiguous. So when you request SELECT COUNT(*) from a MyISAM table, it will only return the number of rows in memory (but this is of no use if you do a SELECT COUNT(*)) Use the WHERE clause by Some conditions count some subset of rows, so in this case it must actually count them).

In general, most people think that InnoDB's support for concurrent updates is well worth it, and they are willing to sacrifice the optimization of SELECT COUNT(*).

P粉356128676

Except what Bill said...

Minimum index

InnoDB selects the "smallest" index to perform COUNT(*). It is possible that the index of all communication is greater than the minimum index of transaction, so there is a time difference. When determining the size of an index, include the PRIMARY KEY column along with any secondary indexes:

PRIMARY KEY(id),   -- INT (4 bytes)
INDEX(flag),       -- TINYINT (1 byte)
INDEX(name),       -- VARCHAR(255) (? bytes)

To measure size, PRIMARY KEY is large because it contains (due to clustering) all columns of the table. INDEX(flag) is "5 bytes". INDEX(name) On average there may be dozens of bytes. SELECT COUNT(*) will explicitly select INDEX(flag).

Apparently Transactions have a "small" index, but Communications do not.

TEXT/BLOG Columns are sometimes stored "unlogged". Therefore, they are not included in the size of the PK index.

Query cache

If "Query Cache" is turned on, the second run of the query may be much faster than the first. But this only happens if the table has not changed. QC is rarely useful in production systems because any change to a table invalidates all QC entries for that table. By "faster" I mean about 0.001 seconds; not 1.44 seconds.

The difference between 1m38s and 1.44s may be due to what is cached in buffer_pool (InnoDB's general cache area). The first run probably couldn't find any "minimal" index in RAM, so it did a lot of I/O, taking 98 seconds to get all 4.5 million rows for that index. The second run found that all the data was cached in the buffer_pool, so it ran at CPU speed (no I/O), so it was much faster.

good enough

In this case, I fundamentally question the need to do COUNT(*). Note how you say "2.8 mio entry" as if 2 significant figures is "good enough". Wouldn't that be "good enough" if you showed the count to the user on the UI? If so, one solution for performance would be to count it once a day and store it somewhere. This will allow immediate access to "good enough" values.

There are other technologies. One is to use activity tags or some form of summary table to keep the counters updated.

Throw hardware at it

You've found that changing the hardware doesn't help.

  • 98 runs as fast as any of RDS's I/O products.
  • 1.44s runs as fast as any RDS CPU.
  • MySQL (and its variants) does not use multiple CPUs per query.
  • You have enough RAM so the entire "small" index can be put into the buffer_pool until your second SELECT COUNT(*).. (Too little RAM will cause the second runs very slowly) .)
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template