检索每组中的最后一条记录 - MySQL

Question

有包含数据的表消息，如下所示：IdNameOther_Columns------------------------1AA_data_12AA_data_23AA_data_34BB_data_15BB_data_26CC_data_1IfIrunaqueryselect*frommessagesgroupbyname,Iwi

P粉786432579 · Answer

UPD：2017-03-31，版本5.7.5 MySQL 默认启用 ONLY_FULL_GROUP_BY 开关（因此，非确定性 GROUP BY 查询被禁用）。此外，他们更新了 GROUP BY 实现，即使禁用了开关，该解决方案也可能无法按预期工作。需要检查一下。

Bill Karwin's solution above works fine when item count within groups is rather small, but the performance of the query becomes bad when the groups are rather large, since the solution requires about n*n/2 + n/2 of only IS NULL comparisons.

I made my tests on a InnoDB table of 18684446 rows with 1182 groups. The table contains testresults for functional tests and has the (test_id, request_id) as the primary key. Thus, test_id is a group and I was searching for the last request_id for each test_id.

Bill's solution has already been running for several hours on my dell e4310 and I do not know when it is going to finish even though it operates on a coverage index (hence using index in EXPLAIN).

我有几个基于相同想法的其他解决方案：

if the underlying index is BTREE index (which is usually the case), the largest (group_id, item_value) pair is the last value within each group_id, that is the first for each group_id if we walk through the index in descending order;
如果我们读取索引覆盖的值，则按照索引的顺序读取值；
每个索引都隐式包含附加到该索引的主键列（即主键位于覆盖索引中）。在下面的解决方案中，我直接对主键进行操作，在您的情况下，您只需要在结果中添加主键列即可。
在许多情况下，在子查询中按所需的顺序收集所需的行 ID 并将子查询的结果连接到 ID 上要便宜得多。由于对于子查询结果中的每一行，MySQL 将需要基于主键进行一次获取，因此子查询将首先放在连接中，并且行将按照子查询中 id 的顺序输出（如果我们省略显式 ORDER BY用于连接）

MySQL 使用索引的 3 种方式是一篇很棒的文章，可以帮助您了解一些细节。

解决方案1

这个速度快得令人难以置信，在我的 18M+ 行上大约需要 0.8 秒：

SELECT test_id, MAX(request_id) AS request_id
FROM testresults
GROUP BY test_id DESC;

如果您想将顺序更改为 ASC，请将其放入子查询中，仅返回 ids 并将其用作子查询来连接其余列：

SELECT test_id, request_id
FROM (
    SELECT test_id, MAX(request_id) AS request_id
    FROM testresults
    GROUP BY test_id DESC) as ids
ORDER BY test_id;

这对我的数据来说大约需要 1.2 秒。

解决方案2

这是另一个解决方案，对于我的表来说大约需要 19 秒：

SELECT test_id, request_id
FROM testresults, (SELECT @group:=NULL) as init
WHERE IF(IFNULL(@group, -1)=@group:=test_id, 0, 1)
ORDER BY test_id DESC, request_id DESC

它也按降序返回测试。它要慢得多，因为它执行完整索引扫描，但它可以让您了解如何为每个组输出 N 个最大行。

该查询的缺点是查询缓存无法缓存其结果。

P粉848442185 · Answer

MySQL 8.0 现在支持窗口函数，例如几乎所有流行的 SQL 实现。使用这个标准语法，我们可以编写每组最大n个查询：

WITH ranked_messages AS (
  SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
  FROM messages AS m
)
SELECT * FROM ranked_messages WHERE rn = 1;

此方法和其他查找的方法分组最大行数在 MySQL 手册中进行了说明。

以下是我在2009年针对这个问题写的原始答案：

我这样写解决方案：

SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
 ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;

关于性能，一种解决方案可能会更好，具体取决于数据的性质。因此，您应该测试这两个查询，并根据您的数据库使用性能更好的查询。

For example, I have a copy of the StackOverflow August data dump. I'll use that for benchmarking. There are 1,114,357 rows in the Posts table. This is running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.

我将编写一个查询来查找给定用户 ID（我的）的最新帖子。

First using the technique shown by @Eric with the GROUP BY in a subquery:

SELECT p1.postid
FROM Posts p1
INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
            FROM Posts pi GROUP BY pi.owneruserid) p2
  ON (p1.postid = p2.maxpostid)
WHERE p1.owneruserid = 20860;

1 row in set (1 min 17.89 sec)

Even the EXPLAIN analysis takes over 16 seconds:

+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| id | select_type | table      | type   | possible_keys              | key         | key_len | ref          | rows    | Extra       |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
|  1 | PRIMARY     |  | ALL    | NULL                       | NULL        | NULL    | NULL         |   76756 |             | 
|  1 | PRIMARY     | p1         | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY     | 8       | p2.maxpostid |       1 | Using where | 
|  2 | DERIVED     | pi         | index  | NULL                       | OwnerUserId | 8       | NULL         | 1151268 | Using index | 
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
3 rows in set (16.09 sec)

Now produce the same query result using my technique with LEFT JOIN:

SELECT p1.postid
FROM Posts p1 LEFT JOIN posts p2
  ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
WHERE p2.postid IS NULL AND p1.owneruserid = 20860;

1 row in set (0.28 sec)

The EXPLAIN analysis shows that both tables are able to use their indexes:

+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| id | select_type | table | type | possible_keys              | key         | key_len | ref   | rows | Extra                                |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
|  1 | SIMPLE      | p1    | ref  | OwnerUserId                | OwnerUserId | 8       | const | 1384 | Using index                          | 
|  1 | SIMPLE      | p2    | ref  | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8       | const | 1384 | Using where; Using index; Not exists | 
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
2 rows in set (0.00 sec)

Here's the DDL for my Posts table:

CREATE TABLE `posts` (
  `PostId` bigint(20) unsigned NOT NULL auto_increment,
  `PostTypeId` bigint(20) unsigned NOT NULL,
  `AcceptedAnswerId` bigint(20) unsigned default NULL,
  `ParentId` bigint(20) unsigned default NULL,
  `CreationDate` datetime NOT NULL,
  `Score` int(11) NOT NULL default '0',
  `ViewCount` int(11) NOT NULL default '0',
  `Body` text NOT NULL,
  `OwnerUserId` bigint(20) unsigned NOT NULL,
  `OwnerDisplayName` varchar(40) default NULL,
  `LastEditorUserId` bigint(20) unsigned default NULL,
  `LastEditDate` datetime default NULL,
  `LastActivityDate` datetime default NULL,
  `Title` varchar(250) NOT NULL default '',
  `Tags` varchar(150) NOT NULL default '',
  `AnswerCount` int(11) NOT NULL default '0',
  `CommentCount` int(11) NOT NULL default '0',
  `FavoriteCount` int(11) NOT NULL default '0',
  `ClosedDate` datetime default NULL,
  PRIMARY KEY  (`PostId`),
  UNIQUE KEY `PostId` (`PostId`),
  KEY `PostTypeId` (`PostTypeId`),
  KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
  KEY `OwnerUserId` (`OwnerUserId`),
  KEY `LastEditorUserId` (`LastEditorUserId`),
  KEY `ParentId` (`ParentId`),
  CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
) ENGINE=InnoDB;

评论者请注意：如果您想要使用不同版本的 MySQL、不同的数据集或不同的表设计进行另一个基准测试，请随意自己做。我已经展示了上面的技术。 Stack Overflow 在这里向您展示如何进行软件开发工作，而不是为您完成所有工作。