How to improve grouping speed
P粉155832941
2023-09-05 10:24:40
<p>I have a large table with over 3B rows and my query now takes 3 days to execute. </p>
<p>This is my query: </p>
<pre class="brush:php;toolbar:false;">insert into frequencywords (word, frequency, filename)
select word, count(*) as frequency, max(filename)
from allwordstemp
group by word</pre>
<p>Basically, my query is to group by word from allwordstemp table, and I want to know the <strong>filename</strong> when frequency = 1, that's why I added max(filename), Because filename needs to be included in the aggregate function, such as max. If frequency > 1, I don't need the value of filename either.
Both tables have 2 indexes on word and filename. </p>
<p>allwordstemp table (filename is an id file): </p>
<pre class="brush:php;toolbar:false;">CREATE TABLE `allwordstemp` (
`word` varchar(45) DEFAULT NULL,
`filename` int(11) DEFAULT NULL,
KEY `idx_allwordstemp_word` (`word`),
KEY `idx_allwordstemp_filename` (`filename`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;</pre>
<p>frequencywords table:</p>
<pre class="brush:php;toolbar:false;">CREATE TABLE `frequencywords` (
`word` varchar(45) DEFAULT NULL,
`frequency` int(11) DEFAULT NULL,
`filename` int(11) DEFAULT NULL,
KEY `idx_frequencywords_word` (`word`),
KEY `idx_frequencywords_frequency` (`frequency`),
KEY `idx_frequencywords_filename` (`filename`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;</pre>
<p>Explain selection:</p>
<pre class="brush:php;toolbar:false;"> ---- ------------- -------------- -- ---------- ------- --------------- ------------------ ----- --------- ------ ------------ ---------- -------- -----
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- ------------- -------------- ------------ ------- --------------- ----------------------- ---------- --- -------------------------------------
| 1 | SIMPLE | allwordstemp | NULL | index | NULL | idx_allwordstemp_word | 138 | NULL | 3487864881 | 100.00 | Using index |
---- ------------- -------------- ------------ ------- --------------- ----------------------- ---------- --- --- ---------------- ---------- ------------- </pre>
<p>How can I make my query faster? </p>
Your filter appears to be based on frequency, not words or filenames. So I would first index all the words by frequency.
Then, assuming frequency is an integer, I would add a WHERE clause like this:
The above will give you a list of all words that appear only once in various file names.
Hope this helps, and all the best!