I'm trying to optimize a SQL query, but I'd like to know how to do it correctly.
SELECT r0_.* FROM ride r0_ use index (ride_booking_id_IDX) LEFT JOIN booking b1_ ON r0_.booking_id = b1_.id LEFT JOIN spot s2_ ON r0_.from_spot_id = s2_.id LEFT JOIN spot s3_ ON r0_.to_spot_id = s3_.id WHERE b1_.start_at <= '2023-04-21' AND b1_.end_at >= '2023-04-20' AND b1_.paid_at IS NOT NULL AND b1_.cancelled_at IS NULL AND ((s2_.zone_id = 1 OR s3_.zone_id = 1)) AND s2_.type = 'parking';
Here, I force the index (booking_id, from_spot_id, to_spot_id)
, which causes the query to execute within 25 seconds of the nearest date for about 100 milliseconds!
booking
table has about 2 million rows, and the ride
table has about 5 million rows.
However, I can see it scanning more rows using the forced index:
id | Choose a type | surface | Partition | type | Possible keys | key | key_len | refer to | OK | Filtered | additional |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Simple | b1_ | scope | Main, booking_id_end_IDX, booking_id_IDX, booking_id_start_IDX, IDX_E00CEDDEB75363F7, IDX_E00CEDDE37D3107C, IDX_E00CEDDEDEA4208C, booking_paid_at_IDX, booking_cancelled_at_IDX | IDX_E00CEDDE37D3107C | 6 | 111456 | 6.6 | Use index conditions; use location | ||
1 | Simple | r0_ | refer to | ride_booking_id_IDX | ride_booking_id_IDX | 109 | ector.b1_.id | 1 | 100.0 | ||
1 | Simple | s2_ | eq_ref | main,IDX_B9327A739F2C3FAB,spot_type_IDX | main | 4 | ector.r0_.from_spot_id | 1 | 72.52 | place of use | |
1 | Simple | s3_ | eq_ref | main | main | 4 | ector.r0_.to_spot_id | 1 | 100.0 | place of use |
Compared to the same query without using indexes:
id | Choose a type | surface | Partition | type | Possible keys | key | key_len | refer to | OK | Filtered | additional |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Simple | s2_ | refer to | main,IDX_B9327A739F2C3FAB,spot_type_IDX | spot_type_IDX | 767 | constant | 161 | 100.0 | Use index conditions | |
1 | Simple | r0_ | refer to | IDX_9B3D7CD0ABAF30D3, IDX_9B3D7CD03301C60, ride_booking_id_IDX, booking_from_spot_to_spot_IDX | IDX_9B3D7CD0ABAF30D3 | 5 | ector.s2_.id | 392 | 100.0 | ||
1 | Simple | b1_ | eq_ref | Main, booking_id_end_IDX, booking_id_IDX, booking_id_start_IDX, IDX_E00CEDDEB75363F7, IDX_E00CEDDE37D3107C, IDX_E00CEDDEDEA4208C, booking_paid_at_IDX, booking_cancelled_at_IDX | main | 108 | ector.r0_.booking_id | 1 | 5.0 | place of use | |
1 | Simple | s3_ | eq_ref | main | main | 4 | ector.r0_.to_spot_id | 1 | 100.0 | place of use |
As far as I can tell, the dates I'm using to compare against start_at
and end_at
are the reason why the query is noticeably faster.
So I try to isolate the slowest parts to smaller queries:
Select * from booking b where b.start_at < '2021-01-01' 和 b.end_at > '2021-01-01';
On the table booking I have two indexes (start_at)
and (end_at)
which help this query run faster as you approach the max and min values Faster (since the index will filter most rows, leaving very few rows to scan).
However, when I take a random value far enough in the past, it becomes much slower. The above query takes 10 seconds to run because it only uses one of the two indexes as expected and I don't know why the explanation for merge_index doesn't show up on such a simple query:
id | Choose a type | surface | Partition | type | Possible keys | key | key_len | refer to | OK | Filtered | additional |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Simple | b | scope | IDX_E00CEDDEB75363F7,IDX_E00CEDDE37D3107C | IDX_E00CEDDEB75363F7 | 6 | 1147319 | 50 | Use index conditions; use location |
Since I can't have an index that satisfies both range conditions, I tried splitting the query in half
SELECT * from booking b INNER JOIN booking b2 use index(booking_id_start_IDX) ON b.id = b2.id and b2.start_at < '2021-01-01' INNER JOIN booking b3 use index(booking_id_end_IDX) ON b.id = b3.id and b3.end_at > '2021-01-01';
This query runs significantly faster, taking approximately 600 milliseconds. However, due to the simplicity of the query and the fact that it returns about 7k rows, I would expect it to be in the double digits at best.
I don't understand why the query doesn't automatically select my indexes (id, start_at)
and (id, end_at)
? What am I missing?
I know I could partition the table to get better results, but I have foreign keys that cannot be deleted, so that's not a solution. Should I consider another schema and have a table that holds the reservation dates separately without any foreign keys and have the reservations table reference it so I can partition the reservations table? Is it possible to use foreign keys to reference a partitioned subscription table at subscription time?
The Mysql engine is running in AWS with the following version: 8.0.mysql_aurora.3.02.2
The output of SELECT @@optimizer_switch
is:
index_merge=on, index_merge_union=on, index_merge_sort_union=on, index_merge_intersection=on, engine_condition_pushdown=on, index_condition_pushdown=on, mrr=on, mrr_cost_based=on, block_nested_loop=on, batched_key_access=off, materialization=on, semijoin=on, loosescan=on, firstmatch=on, duplicateweedout=on, subquery_materialization_cost_based=on, use_index_extensions=on, condition_fanout_filter=on, provided_merge=on, use_invisible_indexes=off, skip_scan=on, hash_join=on, subquery_to_derivative=off , prefer_ordering_index=on, hypergraph_optimizer=off, derivative_condition_pushdown=on
Your index
(id, start_at)
is not selected because there is no fixedid
to search for.Depending on your use case, you may want to create one index on
start_at
and another onend_at
. Afterwards, a simple querySELECT * from booking b where b.start_at '2021-01-01';
will take effect immediately; depending on the search criteria, MySQL may use a MERGE INDEX optimization operation index or both.If you wish to use a single index, you need to choose the order of the fields carefully, since indexes are used in the same order in which they are defined.
Edit: After the OPs edit, here are my thoughts.
This explains things pretty well. Assuming you
SELECT *
, MySQL will be forced to read the entire table. Try selecting onlyid
, in which case it will most likely use the index; as it will be an index-covered query.