PostgreSQL random row selection method
Traditional random row selection methods are inefficient and slow when dealing with large tables containing millions or even billions of records. Two common methods are:
Use random()
to filter:
<code class="language-sql"> select * from table where random() < 0.001;</code>
Use order by random()
and limit
:
<code class="language-sql"> select * from table order by random() limit 1000;</code>
However, due to the need for a full table scan or sorting, these methods are not the best choice for tables with a large number of rows and will cause performance bottlenecks.
Optimization methods for large tables
For the following types of tables, consider the following optimization method, which is significantly faster:
Query:
<code class="language-sql">WITH params AS ( SELECT 1 AS min_id, -- 可选:自定义最小ID起始值 5100000 AS id_span -- 近似ID范围(最大ID - 最小ID + 缓冲) ) SELECT * FROM ( SELECT DISTINCT 1 + trunc(random() * p.id_span)::integer AS id FROM params p, generate_series(1, 1100) g GROUP BY 1 ) r INNER JOIN big ON r.id = big.id LIMIT 1000;</code>
How it works:
ID range estimate:
Random ID generation:
Redundancy and duplication elimination:
Table joins and restrictions:
Why it’s fast:
Minimal index usage:
Optimized random number generation:
Redundancy and duplication elimination:
Other options:
Recursive CTE to handle gaps:
Function wrappers for reuse:
Universal functions for any table:
Materialize views for speed:
TABLE SAMPLE
in PostgreSQL 9.5:
TABLE SAMPLE SYSTEM
" feature to implement a faster but less random row sampling method, ensuring an accurate number of rows is returned. However, keep in mind that the sample may not be completely random due to clustering effects. The above is the detailed content of How to Efficiently Select Random Rows from Large PostgreSQL Tables?. For more information, please follow other related articles on the PHP Chinese website!