Retrieve last record in each group - MySQL

Question

Thereisatablemessagesthatcontainsdataasshownbelow:IdNameOther_Columns-------------------------1AA_data_12AA_data_23AA_data_34BB_data_15BB_data_26CC_data_1IfIrunaqueryselect*frommessagesgroupbyname,Iwi

P粉786432579 · Answer

UPD: 2017-03-31, Version 5.7.5 MySQL enables the ONLY_FULL_GROUP_BY switch by default (so non-deterministic GROUP BY queries are disabled). Additionally, they updated the GROUP BY implementation and the solution may not work as expected even with the switch disabled. Need to check it out.

Bill Karwin's solution above works fine when item count within groups is rather small, but the performance of the query becomes bad when the groups are rather large, since the solution requires about n*n/2 n/2 of only IS NULL comparisons.

I made my tests on a InnoDB table of 18684446 rows with 1182 groups. The table contains testresults for functional tests and has the (test_id, request_id) as the primary key. Thus, test_id is a group and I was searching for the last request_id for each test_id.

Bill's solution has already been running for several hours on my dell e4310 and I do not know when it is going to finish even though it operates on a coverage index (hence using index in EXPLAIN).

I have a couple of other solutions based on the same idea:

if the underlying index is BTREE index (which is usually the case), the largest (group_id, item_value) pair is the last value within each group_id, that is the first for each group_id if we walk through the index in descending order;
If we read the value covered by the index, the values are read in the order of the index;
Each index implicitly contains the primary key columns attached to the index (i.e. the primary key is in a covering index). In the solution below I operate directly on the primary key, in your case you just need to add the primary key column to the result.
In many cases it is much cheaper to collect the required row IDs in the desired order in a subquery and concatenate the results of the subquery to the IDs. Since for each row in the subquery result, MySQL will need to do a fetch based on the primary key, the subquery will be put in the join first, and the rows will be output in the order of the id in the subquery (if we omit the explicit ORDER BY for the join )

3 Ways MySQL Uses Indexes is a great article to help you understand some of the details.

Solution 1

This is incredibly fast, taking about 0.8 seconds on my 18M rows:

SELECT test_id, MAX(request_id) AS request_id
FROM testresults
GROUP BY test_id DESC;

If you want to change the order to ASC, put it in a subquery that returns only the ids and use it as a subquery to join the rest of the columns:

SELECT test_id, request_id
FROM (
    SELECT test_id, MAX(request_id) AS request_id
    FROM testresults
    GROUP BY test_id DESC) as ids
ORDER BY test_id;

This takes about 1.2 seconds for my data.

Solution 2

Here's another solution that took about 19 seconds for my table:

SELECT test_id, request_id
FROM testresults, (SELECT @group:=NULL) as init
WHERE IF(IFNULL(@group, -1)=@group:=test_id, 0, 1)
ORDER BY test_id DESC, request_id DESC

It also returns tests in descending order. It's much slower because it performs a full index scan, but it gives you an idea of how to output the N maximum rows for each group.

The disadvantage of this query is that the query cache cannot cache its results.

P粉848442185 · Answer

MySQL 8.0 now supports window functions, such as almost all popular SQL implementations. Using this standard syntax, we can write up to n queries per group:

WITH ranked_messages AS (
  SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
  FROM messages AS m
)
SELECT * FROM ranked_messages WHERE rn = 1;

This method and other methods of finding the maximum number of rows grouped are described in the MySQL manual.

The following is the original answer I wrote to this question in 2009:

I wrote the solution like this:

SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
 ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;

Regarding performance, one solution may be better depending on the nature of the data. Therefore, you should test both queries and use the one with better performance based on your database.

For example, I have a copy of the StackOverflow August data dump. I'll use that for benchmarking. There are 1,114,357 rows in the Posts table. This is running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.

I will write a query to find the latest posts for a given user ID (mine).

First using the technique shown by @Eric with the GROUP BY in a subquery:

SELECT p1.postid
FROM Posts p1
INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
            FROM Posts pi GROUP BY pi.owneruserid) p2
  ON (p1.postid = p2.maxpostid)
WHERE p1.owneruserid = 20860;

1 row in set (1 min 17.89 sec)

Even the EXPLAIN analysis takes over 16 seconds:

+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| id | select_type | table      | type   | possible_keys              | key         | key_len | ref          | rows    | Extra       |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
|  1 | PRIMARY     |  | ALL    | NULL                       | NULL        | NULL    | NULL         |   76756 |             | 
|  1 | PRIMARY     | p1         | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY     | 8       | p2.maxpostid |       1 | Using where | 
|  2 | DERIVED     | pi         | index  | NULL                       | OwnerUserId | 8       | NULL         | 1151268 | Using index | 
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
3 rows in set (16.09 sec)

Now produce the same query result using my technique with LEFT JOIN:

SELECT p1.postid
FROM Posts p1 LEFT JOIN posts p2
  ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
WHERE p2.postid IS NULL AND p1.owneruserid = 20860;

1 row in set (0.28 sec)

The EXPLAIN analysis shows that both tables are able to use their indexes:

+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| id | select_type | table | type | possible_keys              | key         | key_len | ref   | rows | Extra                                |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
|  1 | SIMPLE      | p1    | ref  | OwnerUserId                | OwnerUserId | 8       | const | 1384 | Using index                          | 
|  1 | SIMPLE      | p2    | ref  | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8       | const | 1384 | Using where; Using index; Not exists | 
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
2 rows in set (0.00 sec)

Here's the DDL for my Posts table:

CREATE TABLE `posts` (
  `PostId` bigint(20) unsigned NOT NULL auto_increment,
  `PostTypeId` bigint(20) unsigned NOT NULL,
  `AcceptedAnswerId` bigint(20) unsigned default NULL,
  `ParentId` bigint(20) unsigned default NULL,
  `CreationDate` datetime NOT NULL,
  `Score` int(11) NOT NULL default '0',
  `ViewCount` int(11) NOT NULL default '0',
  `Body` text NOT NULL,
  `OwnerUserId` bigint(20) unsigned NOT NULL,
  `OwnerDisplayName` varchar(40) default NULL,
  `LastEditorUserId` bigint(20) unsigned default NULL,
  `LastEditDate` datetime default NULL,
  `LastActivityDate` datetime default NULL,
  `Title` varchar(250) NOT NULL default '',
  `Tags` varchar(150) NOT NULL default '',
  `AnswerCount` int(11) NOT NULL default '0',
  `CommentCount` int(11) NOT NULL default '0',
  `FavoriteCount` int(11) NOT NULL default '0',
  `ClosedDate` datetime default NULL,
  PRIMARY KEY  (`PostId`),
  UNIQUE KEY `PostId` (`PostId`),
  KEY `PostTypeId` (`PostTypeId`),
  KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
  KEY `OwnerUserId` (`OwnerUserId`),
  KEY `LastEditorUserId` (`LastEditorUserId`),
  KEY `ParentId` (`ParentId`),
  CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
) ENGINE=InnoDB;

Note to commenters: If you want to run another benchmark using a different version of MySQL, a different dataset, or a different table design, feel free to do it yourself. I've demonstrated the technique above. Stack Overflow is here to show you how to do software development work, not to do all the work for you.